1,160 research outputs found

    Efficiency of propensity score adjustment and calibration on the estimation from non-probabilistic online surveys

    Get PDF
    One of the main sources of inaccuracy in modern survey techniques, such as online and smartphone surveys, is the absence of an adequate sampling frame that could provide a probabilistic sampling. This kind of data collection leads to the presence of high amounts of bias in final estimates of the survey, specially if the estimated variables (also known as target variables) have some influence on the decision of the respondent to participate in the survey. Various correction techniques, such as calibration and propensity score adjustment or PSA, can be applied to remove the bias. This study attempts to analyse the efficiency of correction techniques in multiple situations, applying a combination of propensity score adjustment and calibration on both types of variables (correlated and not correlated with the missing data mechanism) and testing the use of a reference survey to get the population totals for calibration variables. The study was performed using a simulation of a fictitious population of potential voters and a real volunteer survey aimed to a population for which a complete census was available. Results showed that PSA combined with calibration results in a bias removal considerably larger when compared with calibration with no prior adjustment. Results also showed that using population totals from the estimates of a reference survey instead of the available population data does not make a difference in estimates accuracy, although it can contribute to slightly increment the variance of the estimator

    Propensity score adjustment using machine learning classification algorithms to control selection bias in online surveys

    Get PDF
    Modern survey methods may be subject to non-observable bias, from various sources. Among online surveys, for example, selection bias is prevalent, due to the sampling mechanism commonly used, whereby participants self-select from a subgroup whose characteristics differ from those of the target population. Several techniques have been proposed to tackle this issue. One such is Propensity Score Adjustment (PSA), which is widely used and has been analysed in various studies. The usual method of estimating the propensity score is logistic regression, which requires a reference probability sample in addition to the online nonprobability sample. The predicted propensities can be used for reweighting using various estimators. However, in the online survey context, there are alternatives that might outperform logistic regression regarding propensity estimation. The aim of the present study is to determine the efficiency of some of these alternatives, involving Machine Learning (ML) classification algorithms. PSA is applied in two simulation scenarios, representing situations commonly found in online surveys, using logistic regression and ML models for propensity estimation. The results obtained show that ML algorithms remove selection bias more effectively than logistic regression when used for PSA, but that their efficacy depends largely on the selection mechanism employed and the dimensionality of the data.This study was partially supported by Ministerio de Economía y Competitividad, Spain [grant number MTM2015-63609-R] and, in terms of the first author, a FPU grant from the Ministerio de Ciencia, Innovacio´n y Universidades, Spain. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript

    Variable selection in Propensity Score Adjustment to mitigate selection bias in online surveys

    Get PDF
    The development of new survey data collection methods such as online surveys has been particularly advantageous for social studies in terms of reduced costs, immediacy and enhanced questionnaire possibilities. However, many such methods are strongly affected by selection bias, leading to unreliable estimates. Calibration and Propensity Score Adjustment (PSA) have been proposed as methods to remove selection bias in online nonprobability surveys. Calibration requires population totals to be known for the auxiliary variables used in the procedure, while PSA estimates the volunteering propensity of an individual using predictive modelling. The variables included in these models must be carefully selected in order to maximise the accuracy of the final estimates. This study presents an application, using synthetic and real data, of variable selection techniques developed for knowledge discovery in data to choose the best subset of variables for propensity estimation.We also compare the performance of PSA using different classification algorithms, after which calibration is applied. We also present an application of this methodology in a real-world situation, using it to obtain estimates of population parameters. The results obtained show that variable selection using appropriate methods can provide less biased and more efficient estimates than using all available covariatesMinisterio de Ciencia e Innovación, Spain [Grant No. PID2019-106861RBI00/AEI/10.13039/501100011033]. FPU grant from Ministerio de Ciencia, Innovación y Universidades. Funding for open access charge: Universidad de Granada / CBUA Spain. IMAG-Maria de Maeztu CEX2020-001105-M/AEI/10.13039/50110001103

    Inference from Non-Probability Surveys with Statistical Matching and Propensity Score Adjustment Using Modern Prediction Techniques

    Get PDF
    We would like to thank the anonymous referees for their remarks and comments that have improved the presentation and the contents of the paper.Online surveys are increasingly common in social and health studies, as they provide fast and inexpensive results in comparison to traditional ones. However, these surveys often work with biased samples, as the data collection is often non-probabilistic because of the lack of internet coverage in certain population groups and the self-selection procedure that many online surveys rely on. Some procedures have been proposed to mitigate the bias, such as propensity score adjustment (PSA) and statistical matching. In PSA, propensity to participate in a nonprobability survey is estimated using a probability reference survey, and then used to obtain weighted estimates. In statistical matching, the nonprobability sample is used to train models to predict the values of the target variable, and the predictions of the models for the probability sample can be used to estimate population values. In this study, both methods are compared using three datasets to simulate pseudopopulations from which nonprobability and probability samples are drawn and used to estimate population parameters. In addition, the study compares the use of linear models and Machine Learning prediction algorithms in propensity estimation in PSA and predictive modeling in Statistical Matching. The results show that statistical matching outperforms PSA in terms of bias reduction and Root Mean Square Error (RMSE), and that simpler prediction models, such as linear and k-Nearest Neighbors, provide better outcomes than bagging algorithms.Ministerio de Economia, Industria y Competitividad, Spain MTM2015-63609-RMinisterio de Ciencia, Innovacion y Universidades, Spain FPU17/0217

    Evaluating Machine Learning methods for estimation in online surveys with superpopulation modeling

    Get PDF
    Online surveys, despite their cost and effort advantages, are particularly prone to selection bias due to the differences between target population and potentially covered population (online population). This leads to the unreliability of estimates coming from online samples unless further adjustments are applied. Some techniques have arisen in the last years regarding this issue, among which superpopulation modeling can be useful in Big Data context where censuses are accessible. This technique uses the sample to train a model capturing the behavior of a target variable which is to be estimated, and applies it to the nonsampled individuals to obtain population-level estimates. The modeling step has been usually done with linear regression or LASSO models, but machine learning (ML) algorithms have been pointed out as promising alternatives. In this study we examine the use of these algorithms in the online survey context, in order to evaluate and compare their performance and adequacy to the problem. A simulation study shows that ML algorithms can effectively volunteering bias to a greater extent than traditional methods in several scenarios.Ministerio de Economía y Competitividad, SpainMinisterio de Ciencia, Innovación y Universidades, Spai

    The R package NonProbEst for estimation in non-probability surveys

    Get PDF
    Different inference procedures are proposed in the literature to correct selection bias that might be introduced with non-random sampling mechanisms. The R package NonProbEst enables the estimation of parameters using some of these techniques to correct selection bias in non-probability surveys. The mean and the total of the target variable are estimated using Propensity Score Adjustment, calibration, statistical matching, model-based, model-assisted and model-calibratated techniques. Confidence intervals can also obtained for each method. Machine learning algorithms can be used for estimating the propensities or for predicting the unknown values of the target variable for the non-sampled units. Variance of a given estimator is performed by two different Leave-One-Out jackknife procedures. The functionality of the package is illustrated with example data sets

    Self-Perceived Health, Life Satisfaction and Related Factors among Healthcare Professionals and the General Population: Analysis of an Online Survey, with Propensity Score Adjustment

    Get PDF
    Healthcare professionals (HCPs) often suffer high levels of depression, stress, anxiety and burnout. Our main study aimswereto estimate the prevalences of poor self-perceived health, life dissatisfaction, chronic disease and unhealthy habits among HCPs and to explore the use of machine learning classification algorithms to remove selection bias. A sample of Spanish HCPs was asked to complete a web survey. Risk factors were identified by multivariate ordinal regression models. To counteract the absence of probabilistic sampling and representation, the sample was weighted by propensity score adjustment algorithms. The logistic regression algorithm was considered the most appropriate for dealing with misestimations. Male HCPs had significantly worse lifestyle habits than their female counterparts, together with a higher prevalence of chronic disease and of health problems. Members of the general population reported significantly poorer health and less satisfaction with life than the HCPs. Among HCPs, the prior existence of health problems was most strongly associated with worsening self-perceived health and decreased life satisfaction, while obesity had an important negative impact on female practitioners’ self-perception of health. Finally, the HCPs who worked as nurses had poorer self-perceptions of health than other HCPs, and the men who worked in primary care had less satisfaction with their lives than those who worked in other levels of healthcare.Ministerio de Ciencia e Innovación, Spai

    Estimating General Parameters from Non-Probability Surveys Using Propensity Score Adjustment

    Get PDF
    This study introduces a general framework on inference for a general parameter using nonprobability survey data when a probability sample with auxiliary variables, common to both samples, is available. The proposed framework covers parameters from inequality measures and distribution function estimates but the scope of the paper is broader. We develop a rigorous framework for general parameter estimation by solving survey weighted estimating equations which involve propensity score estimation for units in the non-probability sample. This development includes the expression of the variance estimator, as well as some alternatives which are discussed under the proposed framework. We carried a simulation study using data from a real-world survey, on which the application of the estimation methods showed the effectiveness of the proposed design-based inference on several general parameters.Spanish Government MTM2015-63609-RInstituto de Salud Carlos III Spanish Government PID2019-106861RB-I00-AEI-10.13039/50110001103

    Combining Statistical Matching and Propensity Score Adjustment for Inference from Non-Probability Surveys

    Get PDF
    The convenience of online surveys has quickly increased their popularity for data collection. However, this method is often non-probabilistic as they usually rely on selfselection procedures and internet coverage. These problems produce biased samples. In order to mitigate this bias, some methods like Statistical Matching and Propensity Score Adjustment (PSA) have been proposed. Both of them use a probabilistic reference sample with some covariates in common with the convenience sample. Statistical Matching trains a machine learning model with the convenience sample which is then used to predict the target variable for the reference sample. These predicted values can be used to estimate population values. In PSA, both samples are used to train a model which estimates the propensity to participate in the convenience sample. Weights for the convenience sample are then calculated with those propensities. In this study, we propose methods to combine both techniques. The performance of each proposed method is tested by drawing nonprobability and probability samples from real datasets and using them to estimate population parameters.Ministerio de Economía y Competitividad of Spai
    corecore